Skip to content

v0.4.2: real Gemini eval — deep research + UX design#7

Open
RyanAlberts wants to merge 3 commits intomainfrom
claude/v0.4.2-gemini-eval
Open

v0.4.2: real Gemini eval — deep research + UX design#7
RyanAlberts wants to merge 3 commits intomainfrom
claude/v0.4.2-gemini-eval

Conversation

@RyanAlberts
Copy link
Copy Markdown
Owner

Summary

Working example of /run-eval against an external HTTP target. First cross-vendor eval in pmstack.

What's in here

outputs/eval-gemini-deepresearch-uxdesign-2026-04-25.yaml — 6 test cases:

Deep research (3)

  • dr-1-multisource-synthesis (P0) — compare 4 LLM agent memory approaches with named systems
  • dr-2-counterfactual (P1) — counterfactual reasoning under uncertainty
  • dr-3-citations-honesty (P0) — list 5 academic papers (the classic hallucination trap)

UX design (3)

  • ux-1-html-form (P1) — concrete mobile-first sign-up form (HTML+CSS, accessible)
  • ux-2-flow-critique (P1) — prioritize 3 UX gaps in a SaaS onboarding flow
  • ux-3-experiment-design (P1) — A/B test design with explicit assumptions

Target: gemini-2.5-pro via generativelanguage.googleapis.com. Auth via x-goog-api-key header sourced from GEMINI_API_KEY env var. Judge: claude-sonnet-4-6 (different family, no self-grading bias).

Verified

  • Runner hard-stops with FATAL: requires env var GEMINI_API_KEY when env not set
  • With placeholder env, loads 6 cases cleanly, ~48K total token estimate (well under 200K warn)

How to run (when reviewer / user has a Gemini API key)

export GEMINI_API_KEY="your_key_here"
python3 bin/run-eval.py outputs/eval-gemini-deepresearch-uxdesign-2026-04-25.yaml \
  --judge-model claude-sonnet-4-6

To compare against Claude: duplicate the YAML, change target.type to claude-session with model: claude-sonnet-4-6, run again, diff the two summary.md files.

Cost

~$0.20 in Gemini tokens + ~$0.20 in Claude judge tokens for the full 6-case run.

Holding before merge

Will merge after the actual run produces clean output and any wording / metric tweaks land. (Easier to fix on an open PR than chase with follow-ups.)

🤖 Generated with Claude Code

RyanAlberts and others added 2 commits April 24, 2026 23:58
Working example of /run-eval against an external HTTP target. Designed
to be run by the user against their own Gemini subscription as a real
test of the http target type and as the first cross-vendor eval in
pmstack.

Six test cases:
  3 deep research — multi-source synthesis, counterfactual reasoning,
                    citation honesty (the classic hallucination trap)
  3 UX design     — concrete HTML form, prioritized flow critique,
                    A/B test design with assumptions

Target: gemini-2.5-pro via generativelanguage.googleapis.com.
Auth: x-goog-api-key header sourced from GEMINI_API_KEY env var.
Judge: claude-sonnet-4-6 (different family, no self-grading bias).

Verified the runner hard-stops cleanly when GEMINI_API_KEY is unset
("FATAL: requires env var GEMINI_API_KEY"). When set, the eval loads
6 cases at ~48K total token estimate (well under the 200K warn
threshold) — affordable smoke test against a real external API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
outputs/eval-runs/ is per-user run output, not canonical example
material. The parent outputs/ stays tracked so the example artifacts
(Ultraplan eval, Gemini eval YAML, roadmap, verification report)
remain in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new evaluation configuration for Gemini 2.5 Pro, targeting deep research and UX design capabilities. The review feedback identifies several areas for improvement: aligning the failure_modes and response_path formats with project templates for better compatibility, and adjusting metrics for specific test cases to avoid false negatives where the prompt constraints conflict with the evaluation criteria.

Comment on lines +43 to +55
failure_modes:
- id: fm-1
name: "Hallucinated facts in research synthesis"
severity: P0
- id: fm-2
name: "Refuses or stalls on time-sensitive questions instead of dating its claims"
severity: P1
- id: fm-3
name: "Generic 'best practices' answer instead of design with specifics"
severity: P1
- id: fm-4
name: "Critique that lists everything as a problem (no prioritization)"
severity: P1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The structure of failure_modes deviates from the templates/eval-template.yaml (lines 72-74), which defines them as a list of strings. If the runner expects the format defined in the template, this structured object approach will cause a failure or incorrect rendering in the summary output.

failure_modes:
  - "fm-1: Hallucinated facts in research synthesis (P0)"
  - "fm-2: Refuses or stalls on time-sensitive questions instead of dating its claims (P1)"
  - "fm-3: Generic 'best practices' answer instead of design with specifics (P1)"
  - "fm-4: Critique that lists everything as a problem (no prioritization) (P1)"

url: "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro:generateContent"
method: POST
request_template: '{"contents":[{"parts":[{"text":"{{input}}"}]}],"generationConfig":{"maxOutputTokens":2048,"temperature":0.3}}'
response_path: ".candidates[0].content.parts[0].text"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The response_path is missing the $ root prefix used in the templates/eval-template.yaml (line 60). While some JSONPath implementations allow omitting it, including it ensures compatibility with standard libraries like jsonpath-ng and maintains consistency with the project's own templates.

  response_path: "$.candidates[0].content.parts[0].text"

description: "Concrete HTML/CSS for a constrained spec. Tests whether the model can produce usable code, not generic templates."
input: "Write a mobile-first sign-up form for a B2B SaaS. Fields: work email, company name, team size (1-10/11-50/51+/dropdown). Include client-side validation hints (HTML5 attributes, no JS), accessible labels, and a single primary button. Output only the HTML+CSS, in one code block."
expected_behavior: "Working HTML+CSS, mobile-first (viewport meta, max-width container), accessible labels, validation attrs. Not just a template."
metrics: ["Substance", "Structure", "Length"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Structure metric (defined in line 66) evaluates the presence of "headers, lists, comparison tables". Since this test case explicitly requests "only the HTML+CSS, in one code block" (line 102), the model will likely receive a low score for structure even if it follows the prompt perfectly. Consider removing this metric for this specific case to avoid false negatives.

    metrics: ["Substance", "Length"]

Real-world test against Gemini 2.5 Pro hit immediate 429s on every
call (free-tier quota exhausted). The runner correctly captured the
errors and refused to fake scores. Two improvements that came out of
debugging:

bin/run-eval.py
- target.delay_between_cases_sec: optional sleep between cases for
  rate-limit respect on http targets
- 429 errors now emit a clear WARN telling the user to check their
  quota / try a smaller model / set a delay
- Captures the full HTTP error body (300 chars) into evidence so
  diagnosis doesn't require re-reading raw network logs

eval YAML
- Switched from gemini-2.5-pro to gemini-2.0-flash (more permissive
  free-tier limits)
- Added delay_between_cases_sec: 6
- Documented why in description

Note: the actual Gemini comparison didn't produce real scores in this
run (quota exhausted on user's API key). The eval will need to be
re-run after the user's quota resets or with a key that has access.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant